library(tidyverse)
library(moderndive)
library(gridExtra)
What proportion of the balls are red? We could count all of them (take a census), but that would be a long process.
MD Figure 7.1: A bowl filled with red and white balls.
Instead of counting the entire population (census), we can select a random sample to use as an estimate.
MD Figure 7.3: Removing 50 balls from the bowl.
Assuming there are more than 50 red balls and 50 white balls, when we take a random sample of 50 balls from the bowl we could get anywhere from 0 to 50 red balls (or a proportion of 0.00 = 0% to 1.00 = 100%). In other words, all proportions are possible. But are they all equally probable? Why or why not?
ANSWER:
We cannot physically replicate this experiment, but we can do so virtually. The moderndive package contains a virtual object bowl that represents the physical object. Scoop one virtual_shovel of balls by running the following code.
virtual_shovel <- bowl %>%
rep_sample_n(size = 50, reps = 1)
The rep_sample_n() function is in package the moderndive.
rep_sample_n(tbl, size, replace = FALSE, reps = 1, prob = NULL)
Defaults are sampling without replacement, equal probability of being selected, and one repetition of the chosen size.
The proportion of red balls in your sample is an estimate of the proportion of red balls in the bowl.
virtual_shovel %>%
mutate(is_red = (color == "red")) %>%
summarize(num_red = sum(is_red),
prop_red = mean(is_red))
# A tibble: 1 x 3
replicate num_red prop_red
<int> <int> <dbl>
1 1 15 0.3
ggplot(virtual_shovel, aes(x = color)) +
geom_bar(color = "black", fill = "skyblue")
Note that the code is slightly different than the code found in Modern Dive, Chapter 7. This code is algebraically equivalent, but automatically works for sample sizes other than 50.
What if we wanted to calculate the count and proportion of white balls in the sample? How would you adapt the code?
# code it here!
In the physical scenario described in Modern Dive, there were 33 students in the class, each of whom drew one sample from the physical bowl. Their results are shown below.
class_plot <- ggplot(tactile_prop_red, aes(x = prop_red)) +
geom_histogram(binwidth = 0.05, boundary = 0.5, color = "white", fill = "skyblue") +
scale_x_continuous(breaks = seq(from = 0, to = 1, by = 0.10),
limits = c(0, 1)) +
scale_y_continuous(expand = expansion(mult = c(0, 0.05))) +
labs(title = "Distribution of 33 Sample Proportions of Red Balls",
subtitle = "tactile sampling",
x = "Proportion of Balls That Were Red (sample size n = 50)",
y = "Frequency") +
theme_linedraw()
class_plot
What seems like the most probable value for \(p\) = proportion of red balls in the bowl, based on the plot of 33 samples? Why?
(We could calculate it directly, but in most real-life scenarios we cannot. Exploring cases where the solution is known is how we develop reliable methods for unknown cases.)
Thanks to the power of computers, you can replicate the class sampling process. Set the reps to 33 so you get one sample of size 50 for each virtual student.
virtual_samples <- bowl %>%
rep_sample_n(size = 50, reps = 33)
Calculate the count and proportion of red balls in each sample (replicate).
virtual_prop_red <- virtual_samples %>%
group_by(replicate) %>%
mutate(is_red = (color == "red")) %>%
summarize(num_red = sum(is_red),
prop_red = mean(is_red))
virtual_prop_red <- virtual_samples %>%
group_by(replicate) %>%
mutate(is_red = (color == "red")) %>%
summarize(num_red = sum(is_red),
prop_red = mean(is_red))
virtual_prop_red %>% head(n = 8) # I only have room for a few rows
# A tibble: 8 x 3
replicate num_red prop_red
<int> <int> <dbl>
1 1 22 0.44
2 2 17 0.34
3 3 17 0.34
4 4 18 0.36
5 5 17 0.34
6 6 19 0.38
7 7 16 0.32
8 8 18 0.36
Tactile results will always be the same. Virtual results will differ unless we set.seed() to control our pseudo-random stream.
sim_plot <- ggplot(virtual_prop_red, aes(x = prop_red)) +
geom_histogram(binwidth = 0.05,
boundary = 0.5,
color = "white",
fill = "skyblue") +
scale_x_continuous(breaks = seq(from = 0, to = 1, by = 0.10),
limits = c(0, 1)) +
scale_y_continuous(expand = expansion(mult = c(0, 0.05))) +
labs(title = "Distribution of 33 Sample Proportions of Red Balls",
subtitle = "virtual sampling",
x = "Proportion of Balls That Were Red (sample size n = 50)",
y = "Frequency") +
theme_linedraw()
grid.arrange(class_plot, sim_plot)
We need to understand the sampling distribution of the statistic we are using to estimate a population value. What is the shape, center, and spread (variability)?
Usually we characterize center and spread using mean and SD. Can you tell the shape from the plot?
virtual_prop_red %>%
summarize(mean_props = mean(prop_red),
sd_props = sd(prop_red))
# A tibble: 1 x 2
mean_props sd_props
<dbl> <dbl>
1 0.376 0.0629
Our 33 random samples are not very many if we are trying to understand the true variability that can occur from sample to sample. Let’s take more samples!
Adapt the code to take 1000 samples of size 50. Summarize the count and proportion of red balls in each sample and save it as the object virtual_prop_red_1000.
virtual_prop_red_1000 <- bowl %>%
rep_sample_n(size = 50, reps = 1000) %>%
group_by(replicate) %>%
mutate(is_red = (color == "red")) %>%
summarize(num_red = sum(is_red),
prop_red = mean(is_red))
virtual_prop_red_1000 %>%
summarize(mean_props = mean(prop_red),
sd_props = sd(prop_red))
# A tibble: 1 x 2
mean_props sd_props
<dbl> <dbl>
1 0.374 0.0680
What shape does your distribution of virtual samples appear to have (approximately)?
What shape does your distribution of virtual samples appear to have (approximately)?
TBA!!!
sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] gridExtra_2.3 moderndive_0.5.1 forcats_0.5.1 stringr_1.4.0
[5] dplyr_1.0.5 purrr_0.3.4 readr_1.4.0 tidyr_1.1.3
[9] tibble_3.1.0 ggplot2_3.3.3 tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 lubridate_1.7.10 formula.tools_1.7.1
[4] assertthat_0.2.1 digest_0.6.27 utf8_1.2.1
[7] R6_2.5.0 cellranger_1.1.0 backports_1.2.1
[10] reprex_1.0.0 evaluate_0.14 httr_1.4.2
[13] highr_0.8 pillar_1.5.1 rlang_0.4.10
[16] readxl_1.3.1 rstudioapi_0.13 jquerylib_0.1.3
[19] rmarkdown_2.7 labeling_0.4.2 munsell_0.5.0
[22] broom_0.7.5 compiler_3.6.0 modelr_0.1.8
[25] janitor_2.1.0 xfun_0.22 pkgconfig_2.0.3
[28] htmltools_0.5.1.1 tidyselect_1.1.0 fansi_0.4.2
[31] crayon_1.4.1 dbplyr_2.1.0 withr_2.4.1
[34] grid_3.6.0 jsonlite_1.7.2 gtable_0.3.0
[37] lifecycle_1.0.0 DBI_1.1.1 magrittr_2.0.1
[40] infer_0.5.4 scales_1.1.1 cli_2.3.1
[43] stringi_1.5.3 debugme_1.1.0 farver_2.1.0
[46] fs_1.5.0 snakecase_0.11.0 xml2_1.3.2
[49] bslib_0.2.4 ellipsis_0.3.1 generics_0.1.0
[52] vctrs_0.3.6 tools_3.6.0 glue_1.4.2
[55] hms_1.0.0 yaml_2.2.1 colorspace_2.0-0
[58] operator.tools_1.6.3 rvest_1.0.0 knitr_1.31
[61] haven_2.3.1 sass_0.3.1